Japanese Unknown Word Identification by Character-based Chunking
نویسندگان
چکیده
We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as features. It is applied to newspapers and patent texts, achieving 95% precision and 55-70% recall for newspapers and more than 85% precision for patent texts.
منابع مشابه
Corpus-based Japanese morphological analysis
The goal of this study is to improve corpus-based Japanese morphological analysis which is composed by word segmentation and part-of-speech (below POS) tagging. We divide the problem of Japanese morphological analysis into three subproblems: models for known word, models for unknown word and corpus maintenance schema. Firstly, we discuss Markov model-based approaches for known word processing. ...
متن کاملChinese Unknown Word Identification Using Character-based Tagging and Chunking
Since written Chinese has no space to delimit words, segmenting Chinese texts becomes an essential task. During this task, the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First, a morphological analysis is done to obtain i...
متن کاملChunking-based Chinese Word Tokenization
() () , (log) (log) Abstract This paper introduces a Chinese word tokenization system through HMM-based chunking. Experiments show that such a system can well deal with the unknown word problem in Chinese word tokenization. The second term in (2-1) is the mutual information between T and. In order to simplify the computation of this term, we assume mutual information independence (2-2): 1 1 log...
متن کاملA Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context
We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English...
متن کاملHierarchical Word Structure-based Parsing: A Feasibility Study on UD-style Dependency Parsing in Japanese
In applying word-based dependency parsing such as Universal Dependencies (UD) to Japanese, the uncertainty of word segmentation emerges for defining a word unit of the dependencies. We introduce the following hierarchical word structures to dependency parsing in Japanese: morphological units (a short unit word, SUW) and syntactic units (a long unit word, LUW). This paper describes the results o...
متن کامل